Day Three:
Transform

145 min approx

Overview

Questions

  • What are Tidy data, why are they useful, and how to transform untidy data to tidy one?
  • How to select some variables/columns only?
  • How to filter rows/cases that match certain conditions?
  • How to modify (even create) the content of a (possibly new) variable?
  • How to handle factors effectively in R/Tidyverse?
  • How to handle dates and time in R/Tidyverse?
  • How to handle strings in R/Tidyverse?

Lesson Objectives

To be able to

  • Use pivot_*, separate, unite function from the tidyr package in the Tidyverse to reshape data into tidy one.
  • Select/filter columns/rows of tibbles (i.e., data frames).
  • Change content of variable programmatically, possibly using content from other variables.
  • perform basic factor data management.
  • convert textual date/time into date/time R objects
  • use simple regular expression and main str_* functions to manage strings

Data shape

Tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Untidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Why tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Tidy rules

There are three interrelated rules that make a dataset tidy:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

Why untidy data

  • Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.

Example: tidyverse::billboard dataset.1

library(tidyverse)

billboard

Warning

  • information in column:
    • wk1-wk76 should be a single variable: the week.
    • cell values of wk1-wk76 should be a single variable: the rank.

Start Tidying - tidyr::pivot_longer

  • Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.
library(tidyverse)

billboard |> 
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank"
  )

Important

  • tidyr::pivot_longer convert your data in “longer” format
  • cols: select which variable should be pivoting
  • names_to: define the column hosting the cols colnames
  • values_to: define the column hosting the cols values

Warning

Many possibly uninformative missing information!

Start Tidying - tidyr::pivot_longer

  • Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.
library(tidyverse)

billboard |> 
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

Important

  • tidyr::pivot_longer convert your data in “longer” format
  • cols: select which variable should be pivoting
  • names_to: define the column hosting the cols colnames
  • values_to: define the column hosting the cols values
  • values_drop_na: decide if rows with missing information in values should be removed

Selectors 1

  • var1:var10: variables lying between var1 on the left and var10 on the right.

  • starts_with("a"): names that start with “a”.

  • ends_with("z"): names that end with “z”.

  • contains("b"): names that contain “b”.

  • matches("x.y"): names that match regular expression x.y. 2

  • num_range(x, 1:4): names following the pattern, x1, x2, …, x4.

  • all_of(vars)/any_of(vars): names stored in the character vector vars. all_of(vars) will error if the variables aren’t present; any_of(var) will match just the variables that exist.

  • everything(): all variables.

  • last_col(): furthest column on the right.

  • where(is.numeric): all variables where is.numeric() returns TRUE.

Tip

  • !selection: only variables that don’t match selection.

  • selection1 & selection2: only variables included in both selection1 and selection2.

  • selection1 | selection2: all variables that match either selection1 or selection2

Multiple variable in colnames

who2

Tip

In case of multiple variable in each colname, you can pivoting them maintaining the underling structure. This way you can separate them in a further second step…

who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = "diagnosis_gender_age", 
    values_to = "count"
  )

Multiple variable in colnames

who2

Tip

In case of multiple variable in each colname, you can pivoting them maintaining the underling structure. This way you can separate them in a further second step using tidyr::separate.

who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = "diagnosis_gender_age", 
    values_to = "count"
  ) |> 
  separate(
    col = diagnosis_gender_age,
    into = c("diagnosis", "gender", "age"),
    sep = "_"
  )

Multiple variable in colnames

who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = "diagnosis_gender_age", 
    values_to = "count"
  ) |> 
  separate(
    col = diagnosis_gender_age,
    into = c("diagnosis", "gender", "age"),
    sep = "_"
  )
who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = c("diagnosis", "gender", "age"), 
    names_sep = "_",
    values_to = "count"
  )

Tip

You can also separate colnames containing multiple variables, and matching a regular pattern, into multiple variable in a single step.

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 11-pivoting.R

Your turn

Your turn

…and:

  1. Answer in the pad, with an “x” next to the correct answers. What are the main option for pivot_longer?
  • names_from
  • names_to
  • values_from
  • values_to
  1. Then, open the script 10-pivot_longer.R and follow the instruction step by step.
05:00

Important

To transform a table to a longer one, you need to put some of its columns names_to a new column, and their corresponding values_to another one! Possibly allowing values_drop_na.

tidyr::pivot_wider

Image from Data Carpentry’s R for Social Scientists

Reverse pivot - tidyr::pivot_wider

Animation of tidyverse verbs by Garrick Aden-Buie

Reverse pivot - example

library(tidyverse)
library(janitor)

bb_pivoted_twice <- billboard |> 
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank"
  ) |>
  pivot_wider(
    names_from = "week",
    values_from = "rank" 
  )

all.equal(
  billboard |> remove_empty("cols"),
  bb_pivoted_twice |> remove_empty("cols")
)
[1] TRUE

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 11-pivoting.R

Your turn

Your turn

…and:

  1. Answer in the pad, with an “x” next to the correct answers. What are the main option for pivot_wider?
  • names_from
  • names_to
  • values_from
  • values_to
  1. Then, open the script 11-pivot_wider.R and follow the instruction step by step.
05:00

Important

To transform a table to a wider one, you need to take new column names_from an existing column, and their corresponding values_from the associated one! Possibly with created missing values_filled.

Data management

dplyr - intro

Common structure:

  • The first argument is always a data frame
  • The subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).
  • The output is always a new data frame.

Tip

All verbs in Tidyverse are designed to do one thing mainly, and to it well! So, to solve complex problem we will often combine multiple verbs, and we use the pipe (|>) as we are already familiar!

Rows - dplyr::filter

Important

dplyr::filter allows you to keep rows based on the values of the columns.

library(tidyverse)
library(here)
library(rio)

db <- here("data-raw", "Copenhagen_clean.xlsx") |> 
  import(setclass = "tibble")

db |> 
  filter(age < 18)

Rows - conditions

We can use any kind of condition inside dplyr::filter; e.g.,

And

db |> 
  filter((age < 18) & case)

Tip

If a variable is already a logical one, you can use it directly as it is as a condition! E.g.

db |> 
  filter(case) ## instead of case == TRUE

db |> 
  filter(!case) ## instead of case == FALSE

Rows - conditions

We can use any kind of condition inside dplyr::filter; e.g.,

Or

db |> 
  filter(gastrosymptoms | ate_anything)

Rows - conditions

We can use any kind of condition inside dplyr::filter; e.g.,

In

db |> 
  filter(age %in% 19:25)

Rows - conditions

We can use any kind of condition inside dplyr::filter; e.g.,

Not equal

db |> 
  filter(group != "student")

Rows - multiple conditions

We can also combine together multiple condition of arbitrary complexity at once

db |> 
  filter(!((age < 18) & case))

Tip

It could be difficult to remind the priority order of logical operators. Using parentheses to group each conditions is a safe way to not be wrong!

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 12-filter-and-select.R

Your turn

Your turn

…and:

  1. Imagine to have imported a db with a variable age, and you want to keep rows with age equal to 18 or 21. Before to evaluate it, does the following code return what you need? Answer in the pad, under the section 3.2. Ex20.
library(tidyverse)

db |>
  filter(age == 18 | 21)
  1. Then, open the script 12-filter.R, and follow the instruction step by step.
02:00

Important

Important

  • you can put arbitrary complex conditions returning logical vectors of the same length of the number of rows of the data frame, involving any column of the data frame in use also.

Columns - dplyr::select

For analyses, you do not need to remove columns from your dataset, but it could be extremely useful to see more clearly only the data you need to see time to time.1

You can select the column to keep using the dplyr::select() verb providing:

The variables you like to keep

library(tidyverse)

db |> 
  select(sex, age, case)

Columns - dplyr::select

For analyses, you do not need to remove columns from your dataset, but it could be extremely useful to see more clearly only the data you need to see time to time.1

You can select the column to keep using the dplyr::select() verb providing:

A range of variables you like to keep

library(tidyverse)

db |> 
  select(sex:class)

Columns - dplyr::select

For analyses, you do not need to remove columns from your dataset, but it could be extremely useful to see more clearly only the data you need to see time to time.1

You can select the column to keep using the dplyr::select() verb providing:

Excludig the selection (!)

library(tidyverse)

db |> 
  select(!diarrhoea:jointpain)

Columns - dplyr::select

For analyses, you do not need to remove columns from your dataset, but it could be extremely useful to see more clearly only the data you need to see time to time.1

You can select the column to keep using the dplyr::select() verb providing:

Matching a condition - where

library(tidyverse)

db |> 
  select(where(is.logical))

Selectors 1

  • var1:var10: variables lying between var1 on the left and var10 on the right.

  • starts_with("a"): names that start with “a”.

  • ends_with("z"): names that end with “z”.

  • contains("b"): names that contain “b”.

  • matches("x.y"): names that match regular expression x.y. 2

  • num_range(x, 1:4): names following the pattern, x1, x2, …, x4.

  • all_of(vars)/any_of(vars): names stored in the character vector vars. all_of(vars) will error if the variables aren’t present; any_of(var) will match just the variables that exist.

  • everything(): all variables.

  • last_col(): furthest column on the right.

  • where(is.numeric): all variables where is.numeric() returns TRUE.

Tip

  • !selection: only variables that don’t match selection.

  • selection1 & selection2: only variables included in both selection1 and selection2.

  • selection1 | selection2: all variables that match either selection1 or selection2

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 12-filter-and-select.R

Your turn

Your turn

…and:

  1. Before to evaluate it, in the pad, under the section 3.2. Ex21, write (in a new line) all the possible ways you can imagine to select the variable sex, age, group using dplyr::select from our data frame db imported from Copenhagen_clean.xlsx .

  2. What do you expect the following code will return (including an error):

db |>
  select(any_of(c("age", "foo")))
  1. Then, open the script 13-select.R and follow the instruction step by step.
05:00

Important

  • all_of(vec) is for strict selection. If any of the variables in the character vec is missing, an error is thrown.
  • any_of(vec) doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.

Mutate

We can also add new columns which are calculated from existing ones.

We can use simple algebra

library(tidyverse)

db |> 
  # select just to return few results
  select(id, incubation) |> 
  mutate(
    incubation_days = incubation / 24
  )

Mutate

We can also add new columns which are calculated from existing ones.

We can use functions on variables

library(tidyverse)

db |> 
  # select just to return few results
  select(id, incubation) |> 
  mutate(
    incubation_norm = (
      incubation - mean(incubation, na.rm = TRUE)
    ) / sd(incubation, na.rm = TRUE) 
  )

Mutate

We can also add new columns which are calculated from existing ones.

We can use variables just created

library(tidyverse)

db |> 
  # select just to return few results
  select(id, age, group, class, case) |> 
  mutate(
    adult = (age > 18) & (
      (group != "student") |
      is.na(class)
    ),
    adult_case = adult & case
  )

Mutate

We can also add new columns which are calculated from existing ones.

Pay attention on vectorized Vs. summary functions

library(tidyverse)

sample_df <- tibble(
  x = c(1, 5, 7),
  y = c(3, 2, NA)
)

sample_df |> 
  mutate(
    # rows element-wise
    min_vec = pmin(x, y, na.rm = TRUE),
    max_vec = pmax(x, y, na.rm = TRUE),
    # cols global
    min_all = min(x, y, na.rm = TRUE),
    max_all = max(x, y, na.rm = TRUE),
  )

Warning

  • Summary functions (e.g., min, max):
    • Takes: vectors.
    • Returns: a single value.
  • Vectorized functions (e.g., pmin, pmax):
    • Takes: vectors.
    • Returns: vectors (the same length as the input).

Conditional mutates - Binary: dplyr::if_else

To mutate a variable accordingly to a binary condition

library(tidyverse)

db |> 
  mutate(
    age_class = if_else(
      age >= 18,
      "adult",
      "child"
    )
  ) |> 
  select(age, age_class)

Important

dplyr::if_else requires compatible types in the output.

Conditional mutates - Subsequent: dplyr::case_when

To mutate a variable accordingly to multiple subsequent conditions

library(tidyverse)

db |> 
  mutate(
    age_class = case_when(
      age >  24 ~ "adult (prof)",
      age >= 18 ~ "adult (stud)",
      age >= 15 ~ "young (stud)",
      TRUE      ~ "child"
    )
  ) |> 
  select(age, age_class)

Important

dplyr::case_when takes condition ~ value pairs. condition must be a logical vector; when it’s TRUE, the valule will be used.

  • If none of the cases match, the output gets an NA.
  • Conditions are considered in order, so you should put the most specific case first!
  • TRUE ~ <dafault_value> is used to specify the “default”/catch all value.

Grouping and summarizing

We can group rows into groups meaningful for your analysis by one or more variables, and then summarize each group into a single row performing a summary operation.

library(tidyverse)

db |> 
  group_by(class) |> 
  summarize(
    mean_age = age |> 
      mean(na.rm = TRUE),
    
    n = n(),
    
    n_teachers = sum(
      group == "teacher",
      na.rm = TRUE
    )
  )

Counts

If we want to count the number of rows in each group, we can use simply dplyr::count instead of dplyr::group_by and dplyr::summarize.

db |> 
  count(class)
db |> 
  count(class, group)

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 13-transforming.R

Your turn

Your turn

…and:

  1. Before to try it, in the pad, under the section 3.2. Ex22 write your guess respect the output of using dplyr::mutate assigning the same name of an already existing variable. E.g.
library(tidyverse)

db |> 
  mutate(
  age = age * 365.25
)
  1. Then, open the script 14-mutate.R and follow the instruction step by step.
05:00

Important

As all the other verbs in the Tidyverse, dplyr::mutate

  • It takes a data frame in input, always.
  • It returns a data frame in output, always.
  • It doesn’t change it’s input, never.

Mange principal formats

Factors - why

Using strings for categories is not always the best choice. Factors are the best way to represent categories in R.

  • sorting issues
x1 <- c("Dec", "Apr", "Jan", "Mar")
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"
  • missing/wrong levels issues
x2 <- c("Dec", "Apr", "Jam", "Mar")
x2
[1] "Dec" "Apr" "Jam" "Mar"
  • tabulation issues
table(x1)
x1
Apr Dec Jan Mar 
  1   1   1   1 

Factors - how

Define a set of possible values (levels), as a standard character vector.

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
month_levels
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct"
[11] "Nov" "Dec"

And define a variable as factor, specifying the levels using that.

Base

y1_base <- factor(x1, levels = month_levels)
y1_base
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Tidyverse - (forcats)

library(tidyverse)

y1_tidy <- fct(x1, levels = month_levels)
y1_tidy
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1_base)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1_tidy)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Factors - why tidyverse ({forcats})

If we don’t provide explicit levels, the levels are the unique values in the vector, sorted alphabetically in base R, or in the order of appearance in forcats.

Base

factor(x1)
[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar

Tidyverse - (forcats)

fct(x1)
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar



If there are wrong values in the values used to create a factor, they are included as missing (NA) in base R silently, while forcats throws an (informative!) error.

y2_base <- x2 |> 
  factor(levels = month_levels)
y2_base
[1] Dec  Apr  <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
y2_tidy <- x2 |> 
  fct(levels = month_levels)
Error in `fct()`:
! All values of `x` must appear in `levels` or `na`
ℹ Missing level: "Jam"

Factors - reorder levels

It could be usefull to reordering levels, e.g. when plotting informations.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

library(tidyverse)

# sample dataset from `{forcats}`
# ?gss_cat for information
gss_cat 

Factors - reorder levels

It could be usefull to reordering levels, e.g. when plotting informations.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

relig_summary <- gss_cat |> 
  group_by(relig) |> 
  summarize(
    tv_hours = tvhours |> 
      mean(na.rm = TRUE)
  )
relig_summary

Factors - reorder levels

It could be usefull to reordering levels, e.g. when plotting informations.

We can use forcats::fct_relevel to reorder levels. Its first argument is the factor to reorder, and the following argument is a numeric vector you want to use to reorder the levels.

Tip

Often, the numerical value you use to reorder a factor is another variable in your dataset!

relig_summary |> 
  ggplot(aes(
    x = tv_hours,
    y = relig
  )) +
  geom_point()

relig_summary |> 
  ggplot(aes(
    x = tv_hours,
    y = relig |> 
      fct_reorder(tv_hours)
  )) +
  geom_point()

Factors - reorder levels

There are also many other useful functions in forcats to reorder levels, e.g., fct_infreq and fct_rev. To see all of them, refer to its website https://forcats.tidyverse.org/.

gss_cat |>
  mutate(
    marital = marital |>
      # order by frequency
      fct_infreq() |>
      # reverse the order
      fct_rev()
  ) |>
  ggplot(aes(x = marital)) +
  geom_bar()

Factors - modify (AKA recode) levels

We can also modify levels, e.g., to change the wording, or to merge some of them together.

Change the wording

gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat",
      "Other"                 = "No answer",
      "Other"                 = "Don't know",
      "Other"                 = "Other party"
    )
  ) |>
  count(partyid)

Important

forcats::fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

To combine groups, you can assign multiple old levels to the same new level, or… use forcats::fct_collapse!

Factors - modify (AKA recode) levels

We can also modify levels, e.g., to change the wording, or to merge some of them together.

Change the wording

gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) |>
  count(partyid)

Important

forcats::fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

To combine groups, you can assign multiple old levels to the same new level, or… use forcats::fct_collapse!

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 14-factors.R

Your turn

Your turn

…and:

  1. Before to evaluate it, in the pad, under the section 3.2. Ex23, write (in a new line) how can you convert (within the same db, possibly reassigning its modification to the same name) teh variable sex (which is a character as imported) to factors using forcats::fct from our data frame db imported from Copenhagen_clean.xlsx .

  2. Then, open the script 15-factors.R and follow the instruction step by step.

05:00

Tip

factors from base R, or forcats::fct from forcats are the best way to represent categories in R. They work similarly, but forcats is more informative and more flexible.

Dates and Time

In the Tidyverse, the main package to manage dates and time is lubridate.

Remind

  • Dates are counts (based on ?doubles) of days since 1970-01-01.
  • Date-Time are counts (based on ?doubles) of seconds since 1970-01-01.

To get the current date or date-time you can use today() or now():

library(tidyverse)
today()
now()
[1] "2024-01-15"
[1] "2024-01-15 12:24:54 CET"

Dates and Time - conversion from strings

Dates

ymd("2017-01-31")
mdy("January 31st, 2017")
dmy("31-Jan-2017")
[1] "2017-01-31"
[1] "2017-01-31"
[1] "2017-01-31"

Dates-time

ymd_hms("2017-01-31 20:11:59")
mdy_hm("01/31/2017 20:11")
mdy_h("01/31/2017 20")

# Force date-time supplying a timezone
ymd("2017-01-31", tz = "UTC")
[1] "2017-01-31 20:11:59 UTC"
[1] "2017-01-31 20:11:00 UTC"
[1] "2017-01-31 20:00:00 UTC"
[1] "2017-01-31 UTC"

Date <-> Date-time conversion

as_datetime(today()) |> 
  str()
as_date(now()) |> 
  str()
 POSIXct[1:1], format: "2024-01-15"
 Date[1:1], format: "2024-01-15"

Extracting/Changing components

We can extract or modify components from date/date-time objects using:

  • year()
  • month()
  • day()
  • hour()
  • minute()
  • second()
  • wday() (day of the week)
  • yday() (day of the year)
  • week() (week of the year)
  • quarter() (quarter of the year).

Extract

(today_now <- now())
year(today_now)
month(today_now)
day(today_now)
hour(today_now)
minute(today_now)
second(today_now)
wday(today_now)
yday(today_now)
week(today_now)
quarter(today_now)
[1] "2024-01-15 12:24:54 CET"
[1] 2024
[1] 1
[1] 15
[1] 12
[1] 24
[1] 54.28816
[1] 2
[1] 15
[1] 3
[1] 1

Change

year(today_now)  <- 2020
today_now
month(today_now) <- 12
today_now
day(today_now) <- 30
today_now
hour(today_now) <- 17
today_now
minute(today_now) <- 14
today_now
second(today_now) <- 56
today_now
[1] "2020-01-15 12:24:54 CET"
[1] "2020-12-15 12:24:54 CET"
[1] "2020-12-30 12:24:54 CET"
[1] "2020-12-30 17:24:54 CET"
[1] "2020-12-30 17:14:54 CET"
[1] "2020-12-30 17:14:56 CET"

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 15-date-time.R

Your turn

Your turn

…and:

  1. Before to evaluate it, in the pad, under the section 3.2. Ex24, write (in a new line) how can you convert (within the same db, possibly reassigning its modification to the same name) the variable dayonset (which is a datetime as imported) to pure Date format using {lbridate} functions (from our data frame db imported from Copenhagen_clean.xlsx).

  2. Then, open the script 16-date-time.R and follow the instruction step by step.

05:00

Tip

Date and Date-time are counts of days/seconds since 1970-01-01. Managing them in R is not easy, but lubridate makes it easier.

Strings - Regular Expressions

Regular expressions are a powerful tool for matching text patterns. They are used in many programming languages to find and manipulate strings, and in R are implemented in the stringr package.

Base syntax for regular expressions

  • . matches any character
  • * matches zero or more times
  • + matches one or more times
  • ? matches zero or one time
  • ^ matches the start of a string
  • $ matches the end of a string
  • [] matches any one of the characters inside
  • [^] matches any character not inside the square brackets
  • | matches the pattern either on the left or the right
  • () groups together the pattern on the left and the right

Example

The following match any string that:

  • a contains a (str_view("banana", "a"): )
  • ^a starts with a
  • a$ ends with a
  • ^a$ starts and ends with a
  • ^a.*a$ starts and ends with a, with any number of characters in between
  • ^a.+a$ starts and ends with a, with at least one character in between
  • ^a[bc]+a$ starts and ends with a, with at least one b or c in between
  • ^a(b|c)d$ starts with a, followed by either b or c, followed by an endingd.

Tip

To match special characters, you need to escape them with a double backslash (\\). I.e., you need to use \\., \\*, \\+, \\?, \\^, \\$, \\[, \\], \\|, \\(, \\).

To match a backslash, you need \\\\.

Strings - {stringr}

The stringr package provides a consistent set of functions for working with strings, and it is designed to work consistently with the pipe.

Functions

  • str_detect(): does a string contain a pattern?
  • str_which(): which strings match a pattern?
  • str_subset(): subset of strings that match a pattern
  • str_sub(): extract a substring by position
  • str_replace(): replace the first match with a replacement
  • str_replace_all(): replace all matches with a replacement
  • str_remove(): remove the first match
  • str_remove_all(): remove all matches
  • str_split(): split up a string into pieces
  • str_extract(): extract the first match
  • str_extract_all(): extract all matches
  • str_locate(): locate the first match
  • str_locate_all(): locate all matches
  • str_count(): count the number of matches
  • str_length(): the number of characters in a string

Tip

Because all stringr functions start with str_, in RStudio you can type str_ and then pressing TAB to see all its available functions.

Examples

library(tidyverse)

x <- c("apple", "banana", "pear")
str_detect(x, "[aeiou]")
[1] TRUE TRUE TRUE
str_which(x, "[aeiou]")
[1] 1 2 3
library(tidyverse)

x <- c("apple", "banana", "pear")
str_subset(x, "[aeiou]")
[1] "apple"  "banana" "pear"  
str_sub(x, 1, 3)
[1] "app" "ban" "pea"
library(tidyverse)

x <- c("apple", "banana", "pear")
str_replace(x, "[aeiou]", "x")
[1] "xpple"  "bxnana" "pxar"  
str_replace_all(x, "[aeiou]", "x")
[1] "xpplx"  "bxnxnx" "pxxr"  
str_remove(x, "[aeiou]")
[1] "pple"  "bnana" "par"  
str_remove_all(x, "[aeiou]")
[1] "ppl" "bnn" "pr" 
library(tidyverse)

x <- c("apple", "banana", "pear")
str_split(x, "[aeiou]")
[[1]]
[1] ""    "ppl" ""   

[[2]]
[1] "b" "n" "n" "" 

[[3]]
[1] "p" ""  "r"
str_extract(x, "[aeiou]")
[1] "a" "a" "e"
str_extract_all(x, "[aeiou]")
[[1]]
[1] "a" "e"

[[2]]
[1] "a" "a" "a"

[[3]]
[1] "e" "a"
library(tidyverse)

x <- c("apple", "banana", "pear")
str_locate(x, "[aeiou]")
     start end
[1,]     1   1
[2,]     2   2
[3,]     2   2
str_locate_all(x, "[aeiou]")
[[1]]
     start end
[1,]     1   1
[2,]     5   5

[[2]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[3]]
     start end
[1,]     2   2
[2,]     3   3
library(tidyverse)

x <- c("apple", "banana", "pear")
str_count(x, "[aeiou]")
[1] 2 3 2
str_length(x)
[1] 5 6 4

Strings - concatenate

  • str_c: takes any number of vectors as arguments and returns a character vector of the concatenated values.

  • str_glue: takes a string and interpolates values into it.

library(tidyverse)

tibble(
    x = c("apple", "banana", "pear"),
    y = c("red", "yellow", "green"),
    z = c("round", "long", "round")
  ) |> 
  mutate(
    fruit = str_c(x, y, z),
    fruit_space = str_c(x, y, z, sep = " "),
    fruit_comma = str_c(x, y, z, sep = ", "),
    fruit_glue = str_glue("I like {x}, {y} and {z} fruits")
  )

My turn

YOU: Connect to our pad (https://bit.ly/ubep-rws-pad) and write there questions & doubts (and if I am too slow or too fast)

ME: Connect to the Day-3 project in RStudio cloud (https://bit.ly/ubep-rws-rstudio): script 16-strings.R

Your turn

Your turn

…and:

  1. Before to evaluate it, in the pad, under the section 3.2. Ex25, write (in a new line) how can you match all files names that are R scripts (i.e., ending with .r or .R)? Report you option for a regular expression.

  2. Then, open the script 17-strings.R and follow the instruction step by step.

05:00

Tip

  • All functions in stringr start with str_, so you can type str_ and then pressing TAB to see all its available functions.

  • You can use str_view to see how a regular expression matches a string.

  • str_glue is a powerful tool to concatenate strings and variables.

Homework

Posit’s RStudio Cloud Workspace

Instructions

  • Go to: https://bit.ly/ubep-rws-rstudio

Your turn

  • Project: day-3
  • Instructions:
    • Go to: https://bit.ly/ubep-rws-website
    • The text is the Day-3 assessment under the tab “Summative Assessments”.
    • (on RStudio Cloud) homework/day_three-summative.html
  • Script to complete: homework/solution.R

Acknowledgment

To create the current lesson, we explored, used, and adapted content from the following resources:

The slides are made using Posit’s Quarto open-source scientific and technical publishing system powered in R by Yihui Xie’s Kintr.

Additional resources

License

This work by Corrado Lanera, Ileana Baldi, and Dario Gregori is licensed under CC BY 4.0

References